Statistical Data Cloning for Machine Learning Research Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science in Computer Science

نویسنده

  • GREGORY SHAKHNAROVICH
چکیده

This work is concerned with the estimation of a classifier’s accuracy. We first review some existing methods for error estimation, focusing on cross-validation and bootstrap, and motivate the use of kernel-based smoothing for small sample size. We use the term data cloning to refer to the process of (re)sampling the data via kernel-based smoothed bootstrap. A number of novel estimators based on cloning is presented. Finally, we extend our estimators to to allow cloning of complex real-life data sets, in which a data point may include continuous, bounded, integer and nominal attributes. This allows for better classifier evaluation over heterogeneous real data repositories with limited amount of data, such as the UCI repository. We use the root mean squared error (RMSE) as a measure of estimators quality and support this choice with a probabilistic argument. Using this measure, we report on a set of 28 experiments in which the new cloning methods outperform cross-validation as well as the .632+ bootstrap, which, according to Efron and Tibshirani [13], is the estimator of choice. Although the proposed estimators require more computational effort than the established ones, the increased time complexity is within a constant factor of that of the relevant traditional estimators. Based on the motivation and the empirical results, we suggest that the cloning-based .632+ estimator is superior to the other estimators, and note bootstrapped cross-validation as the second choice.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Machine Learning Approaches to siRNA Efficacy Prediction

OF THESIS Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science Computer Science The University of New Mexico Albuquerque, New Mexico May, 2005 Machine Learning Approaches to siRNA Efficacy Prediction by Sahar Abubucker B.E., Madras University, 2000 M.S., Computer Science, University of New Mexico, 2005

متن کامل

Thesis Submitted in Partial Fulfillment of the requirement for the Degree of M.A/M. Sc In School consultant

Goal: The aim of this study is assess and compare emotional ability of deaf. Semi _ deaf and hearing students (14 _ 20) in Mashhad. Method: To do this experiment out of studies evidence   generally 105 students selecting randomly. From each group, choose the number of normal boys and girls 35, deaf boys and girls and semi deaf boys and girls .this article is useful and explanatory .in this stud...

متن کامل

The Idea Of Using The Steganography As Encryption Tool

the increasing use of computers and the widespread use of networks, Social networking and use applications through the use of the Internet to make the spread images, which make it easy to be penetrated from the attacker and from everyone who try to change the information. So, the need arises to transmit the information securely through a secure manner . Steganography is the best solution to sol...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006